Nginx log analysis with Pandas


In [19]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re

%matplotlib inline

Data import - here be regex

The log format I use for nginx:

'$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" '
'$request_time $upstream_response_time $pipe';

The following regex will match


In [20]:
pattern = re.compile(r"""(?P<remote_addr>.+?) \s-\s
            (?P<remote_user>.+?) \s
            \[(?P<time_local>.+?)\] \s 
            "(?P<method>.+?) \s (?P<uri>.+?) \s (?P<protocol>.+?)" \s
            (?P<status>.+?) \s
            (?P<body_bytes_sent>.+?) \s
            "(?P<http_referer>.+?)"? \s # Optional final " as I initially missed this when setting the format
            "(?P<http_user_agent>.*?)" \s
            (?P<request_time>.+?) \s
            (?P<upstream_response_time>.+?) \s
            (?P<pipe>.+?)""", re.VERBOSE)

In [21]:
test = """195.211.155.181 - - [18/Jan/2015:22:43:05 +0000] "GET /blog/email-templates-using-zend-framework/ HTTP/1.1" """ + \
        """200 7408 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Mozilla/4.0 (compatible; MSIE 6.0; """ + \
        """Windows NT 5.1; SV1) )" 0.039 0.039 ."""

m = re.match(pattern, test)
m.groupdict()


Out[21]:
{'body_bytes_sent': '7408',
 'http_referer': '-',
 'http_user_agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) )',
 'method': 'GET',
 'pipe': '.',
 'protocol': 'HTTP/1.1',
 'remote_addr': '195.211.155.181',
 'remote_user': '-',
 'request_time': '0.039',
 'status': '200',
 'time_local': '18/Jan/2015:22:43:05 +0000',
 'upstream_response_time': '0.039',
 'uri': '/blog/email-templates-using-zend-framework/'}

With our working regex pattern we can parse every line in the log file and store the data in a list of dicts


In [22]:
log_file = '/home/jonathan/dev/access.log'

with open(log_file) as f:
    log_list = []
    for i in f.readlines():
        m = re.match(pattern, i)
        log_list.append(m.groupdict())

Converting the data to a pandas DataFrame only requires passing the list to pd.DataFrame


In [23]:
log = pd.DataFrame(log_list)

print(log.head(1))


  body_bytes_sent http_referer  \
0            2355            -   

                                     http_user_agent method pipe  protocol  \
0  Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31....    GET    .  HTTP/1.1   

     remote_addr remote_user request_time status                  time_local  \
0  173.79.236.46           -        0.005    200  29/Jul/2014:02:32:01 +0100   

  upstream_response_time uri  
0                  0.005   /  

Inspecting the dataframe


In [24]:
log.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 335085 entries, 0 to 335084
Data columns (total 13 columns):
body_bytes_sent           335085 non-null object
http_referer              335085 non-null object
http_user_agent           335085 non-null object
method                    335085 non-null object
pipe                      335085 non-null object
protocol                  335085 non-null object
remote_addr               335085 non-null object
remote_user               335085 non-null object
request_time              335085 non-null object
status                    335085 non-null object
time_local                335085 non-null object
upstream_response_time    335085 non-null object
uri                       335085 non-null object
dtypes: object(13)
memory usage: 35.8+ MB

Currently, all columns are object type. The data will be easier to work with if the types are converted where appropriate

  • time_local -> datetime
  • upstream_response_time -> float
  • request_time -> float
  • body_bytes_sent -> integer

status is a numerical code so could be converted to an integer but as this is a categorical variable we probably don't need to convert it


In [25]:
log['time_local'] = pd.to_datetime(log['time_local'].apply(lambda x : x.split(' ')[0]), format='%d/%b/%Y:%H:%M:%S')

log.loc[log['upstream_response_time'] == '-', 'upstream_response_time'] = np.nan
log[['upstream_response_time', 'request_time']] = log[['upstream_response_time', 'request_time']].astype(np.float64)
log['body_bytes_sent'] = log['body_bytes_sent'].astype(np.int64)

log.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 335085 entries, 0 to 335084
Data columns (total 13 columns):
body_bytes_sent           335085 non-null int64
http_referer              335085 non-null object
http_user_agent           335085 non-null object
method                    335085 non-null object
pipe                      335085 non-null object
protocol                  335085 non-null object
remote_addr               335085 non-null object
remote_user               335085 non-null object
request_time              335085 non-null float64
status                    335085 non-null object
time_local                335085 non-null datetime64[ns]
upstream_response_time    294004 non-null float64
uri                       335085 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(9)
memory usage: 35.8+ MB

The index is currently the default, an incrementing number (in this case 0 to 335084).

Setting the index to the time of the request will simplify analysis later.


In [26]:
log.index = log.pop('time_local')
log.info()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 335085 entries, 2014-07-29 02:32:01 to 2015-01-18 22:03:33
Data columns (total 12 columns):
body_bytes_sent           335085 non-null int64
http_referer              335085 non-null object
http_user_agent           335085 non-null object
method                    335085 non-null object
pipe                      335085 non-null object
protocol                  335085 non-null object
remote_addr               335085 non-null object
remote_user               335085 non-null object
request_time              335085 non-null float64
status                    335085 non-null object
upstream_response_time    294004 non-null float64
uri                       335085 non-null object
dtypes: float64(2), int64(1), object(9)
memory usage: 33.2+ MB

Basic summary information is available with the describe method


In [27]:
log.describe()


Out[27]:
body_bytes_sent request_time upstream_response_time
count 335085.000000 335085.000000 294004.000000
mean 50860.877577 0.505884 0.071010
std 95728.088827 2.068734 0.077905
min 0.000000 0.000000 0.000000
25% 5786.000000 0.038000 0.037000
50% 13548.000000 0.152000 0.043000
75% 24858.000000 0.460000 0.086000
max 3943976.000000 209.385000 3.760000

Some of the requests take a long time. Let's take a closer look

Response time analysis


In [28]:
mask = log['upstream_response_time'] > 1.0
log[mask].shape


Out[28]:
(24, 12)

In [29]:
selected_columns = ['request_time', 'status', 'upstream_response_time', 'uri']
log.loc[mask, selected_columns].sort('upstream_response_time', ascending=False)


Out[29]:
request_time status upstream_response_time uri
time_local
2014-08-22 10:37:23 3.760 200 3.760 /admin/blog/comment/?spam__isnull=True
2014-08-22 22:26:33 2.703 200 2.703 /admin/blog/comment/?
2014-08-26 01:00:09 2.265 200 2.004 /blog/atom/
2014-08-05 13:40:57 2.701 200 1.789 /blog/ai-cookbook-competition/
2014-09-25 15:30:48 1.861 200 1.450 /blog/bcne3-search-phpbb-with-sphinx/
2014-08-22 00:43:35 1.433 200 1.433 /admin/blog/comment/
2014-08-22 15:21:26 1.896 200 1.419 /blog/atom/
2014-08-26 19:20:03 2.207 200 1.415 /blog/tag/python/atom/
2014-09-02 00:13:06 1.326 200 1.326 /blog/ai-cookbook-competition/
2014-08-08 14:57:27 1.714 200 1.238 /blog/tag/python/atom/
2014-08-22 13:33:26 1.472 200 1.208 /blog/atom/
2014-08-13 09:10:04 1.467 200 1.152 /blog/atom/1/
2014-08-22 13:33:15 1.116 200 1.116 /blog/ai-cookbook-competition/
2014-08-05 13:57:32 2.023 200 1.106 /blog/atom/
2014-08-11 10:00:02 1.125 200 1.092 /blog/atom/
2014-08-13 08:00:03 1.418 200 1.083 /blog/atom/1/
2014-08-04 04:20:02 1.304 200 1.061 /blog/atom/
2014-08-22 14:40:06 1.314 200 1.050 /blog/atom/
2014-08-29 09:40:02 1.083 200 1.039 /blog/atom/
2014-08-28 12:20:03 1.276 200 1.021 /blog/atom/
2014-08-04 04:22:18 1.282 200 1.015 /blog/atom/
2014-08-21 17:47:19 1.091 200 1.009 /blog/bcne3-search-phpbb-with-sphinx/
2014-08-05 13:48:25 1.380 200 1.007 /blog/atom/1/
2015-01-15 07:39:28 1.263 200 1.007 /blog/atom/

The slowest requests are on the back end but a couple of blog posts and the atom feed also appear to be slow. Are they always slow or were these requests anomalous?


In [30]:
mask = log['uri'] == '/blog/atom/'
log[mask].describe()


Out[30]:
body_bytes_sent request_time upstream_response_time
count 20331.000000 20331.000000 20329.000000
mean 281056.742610 0.843777 0.249729
std 21001.686369 1.342209 0.053137
min 0.000000 0.000000 0.216000
25% 282935.000000 0.481000 0.227000
50% 282935.000000 0.550000 0.234000
75% 282935.000000 0.692000 0.253000
max 283197.000000 66.848000 2.004000

In [31]:
np.percentile(log.loc[log['uri'] == '/blog/atom/', 'upstream_response_time'], 95)


Out[31]:
0.31900000000000001

Traffic analysis

Daily requests


In [32]:
log_freq = log['body_bytes_sent'].resample('D', how='count')

In [33]:
log_freq.plot()


Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa1b065f0b8>

Top URIs


In [34]:
top_uri = log.groupby('uri').size()
top_uri.sort()
top_uri.tail(10)


Out[34]:
uri
/blog/disk-space-database-table-mysql-postgresql/      5902
/static/js/bootstrap.min.js                            6137
/static/css/bootstrap.css                              6424
/blog/atom/1                                          10395
/blog/atom/1/                                         11856
/blog/ai-cookbook-competition/                        17941
/blog/tag/python/atom/                                19729
/blog/atom/                                           20331
/                                                     20336
/blog/bcne3-search-phpbb-with-sphinx/                110334
dtype: int64

In [35]:
uri_mask = log['uri'] == '/blog/bcne3-search-phpbb-with-sphinx/'
selected_columns = ['remote_addr', 'method', 'http_user_agent', 'http_referer']
log.loc[mask, selected_columns].describe()


Out[35]:
remote_addr method http_user_agent http_referer
count 20331 20331 20331 20331
unique 401 2 114 11
top 72.196.9.247 GET Mozilla/5.0 (compatible; Googlebot/2.1; +http:... -
freq 1596 20243 8339 20301

In [36]:
uri_mask = log['uri'] == '/blog/bcne3-search-phpbb-with-sphinx/'
method_mask = log['method'] == 'POST'
mask = uri_mask & method_mask

selected_columns = ['remote_addr', 'method', 'http_user_agent', 'http_referer']

log.loc[mask, selected_columns].describe()


Out[36]:
remote_addr method http_user_agent http_referer
count 45414 45414 45414 45414
unique 3251 1 271 2
top 66.117.8.34 POST Mozilla/4.0 (compatible; MSIE 6.0; Windows NT ... http://jonathanstreet.com/blog/bcne3-search-ph...
freq 3351 45414 5413 45411